Indirect Spatial Data Extraction from Web Documents
نویسندگان
چکیده
An approach for indirect spatial data extraction by learning restricted finite state automata from web documents created using Bulgarian language are outlined in the paper. It uses heuristics to generalize initial finite-state automata that recognizes only the positive examples and nothing else into automata that recognizes as larger language as possible without extracting any non-positive examples from the training data set. The learning method, program realization and experiments are presented. The investigation is carried out in accordance and following the rules of EU INSPIRE Network.
منابع مشابه
SXPath - Extending XPath towards Spatial Querying on Web Documents
Querying data from presentation formats like HTML, for purposes such as information extraction, requires the consideration of tree structures as well as the consideration of spatial relationships between laid out elements. The underlying rationale is that frequently the rendering of tree structures is very involved and undergoing more frequent updates than the resulting layout structure. Theref...
متن کاملEntity ranking using click-log information
Log information describing the items the users have selected from the set of answers a query engine returns to their queries constitute an excellent form of indirect user feedback that has been extensively used in the web to improve the effectiveness of search engines. In this work we study how the logs can be exploited to improve the ranking of the results returned by an entity search engine. ...
متن کاملKnowledge Extraction from Web Documents Using Self- Organizing Neural Networks
Knowledge discovery is defined as non-trivial extraction of implicit, previously unknown and potentially useful information from given data [1]. Knowledge extraction from web documents deals with unstructured, free-format documents whosenumberisenormousandrapidlygrowing.
متن کاملLandmark Extraction: A Web Mining Approach
Landmarks play crucial roles in human geographic knowledge. There has been much work focusing on the extraction of landmarks from geographic information systems (GIS) or 3D city models. The extraction of landmarks from digital documents, however, has not been fully explored. The World Wide Web provides a rich source of region related information based on our understanding of geographic space. W...
متن کاملOLERA: On-Line Extraction Rule Analysis for Semi-structured Documents
The vast amount of online information available has led to renewed interest in information extraction (IE) systems that analyze input documents to produce a structured representation of selected information from the documents. Information extraction from semistructured documents has been studied extensively recently. Most researches focus on supervised learning approaches where targets must be ...
متن کامل